Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration of first IDataView docs #173

Merged
merged 5 commits into from
May 22, 2018

Conversation

TomFinley
Copy link
Contributor

Migration of some existing internal documentation, rephrased in some places to be more appropriate in context (hopefully successfully). Related to #160, though this PR would be just part of addressing the issue of moving over internal docs.

@dnfclas
Copy link

dnfclas commented May 16, 2018

CLA assistant check
All CLA requirements met.

@TomFinley TomFinley changed the title Tfinley/code docs1 Migration of first IDataView docs May 16, 2018
@TomFinley
Copy link
Contributor Author

TomFinley commented May 16, 2018

Incidentally I am putting this in a docs folder, anticipating that #87 will go through, though that one is taking longer than I might hope. #Closed

The reality cannot be seen by any conventional means I am aware of, save for
viewing ML.NET's workings in the debugger or using the API and inspecting
these raw values yourself: that `4000` you would see is really stored as the
`byte` `1`, `4001` as `2`, `4002` as `3`, and the missing `�` stored as `0`.
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a unicode replacement char '�' in this line. I'm assuming this is intentional, though perhaps there's another way to convey the missing key value in a way which doesn't remind me of Mojibake.
#Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm OK. A verbal description ought to be enough.


In reply to: 188808988 [](ancestors = 188808988)

nonetheless quite valuable to know. That is, not the `IDataView` spec itself,
but many of the logical implications of that spec.

We will here starts with the idioms and practices for `IDataView` generally,
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We will here starts" should be "We will here start" #Resolved

Presumably you are motivated to read this document because you have some
problem of how to get some data into ML.NET, or process data using ML.NET, or
something along these lines. There is a decision to be made about how to even
engineer a solution. Sometimes its quite obvious: text featurization obviously
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Sometimes its quite" should be "Sometimes it's quite" #Resolved

use of LEB128 in places where we are writing values that, on balance, we
expect to be relatively small, and only in cases where there is no potential
for benefit for random access to the associated stream, since LEB128 is
incompatible with random access. However this is not formulated into anything
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"However this is not" should be "However, this is not" #Resolved


Note that identically *constructed* data views are not necessarily
*functionally* identical. Consider this usage of the train and score transform
with `xf=trainScore{tr=ap}`, where we first train averaged perceptron, then
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend UpperCamel TrainScore as this is the case I see most often. Also TrainScore is not currently available in ML.NET (or xf=). And, spelling out tr=AveragedPerceptron may be easier for readers to understand. Granted case and long vs. short is up to personal preference. #Pending

Copy link
Contributor Author

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh... I thought I'd excised all maml command line things. But now I remember, I saw this, threw up my hands, left it for later, then promptly forgot about it till now. This might require a bit more thought.

Unfortunately I don't know that there's a way to express this in the new public API of ML.NET though. Hmmm. We also haven't migrated some things like the markdown saver. Hmmm.


In reply to: 188811867 [](ancestors = 188811867)

Copy link
Contributor Author

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just don't know what to do about this. A verbal description of what would happen if it were possible to do this is fairly feeble.


In reply to: 188843730 [](ancestors = 188843730,188811867)

Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment out the section until which point we have a TrainScore? Or perhaps leave behind in another file until we can move it back here? #Pending

Copy link
Contributor Author

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. Let me think about this. Let me know if you have other ideas.


In reply to: 188877540 [](ancestors = 188877540)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm afraid I have to punt on this one... there's kind of no replacement for this yet. It's kind of minor, we can adjust later I hope.


In reply to: 189015019 [](ancestors = 189015019,188877540)

Moreover: it is a solution to a problem that does not exist. `IDataView`s are
fundamentally composable structures already, and one of the most fundamental
operations you can do is transform columns into different types. So, there is
no need for you to do the conversion yourself. Indeed it is harmful for you to
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Indeed it is harmful" should be "Indeed, it is harmful" #Resolved


* The `DimCount` property indicates the number of dimensions and the `GetDim`
method returns a particular dimension value. All dimension values are non-
negative integers. A zero dimension value indicates unknown (or variable) in
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severely minor: recommend, "A zero-dimension value " #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think perhaps this is a misparse. It's "zero (dimension value)". I can rephrase though, since "2-dimension" and such things are common constructs, and it does resemble that...


In reply to: 188812830 [](ancestors = 188812830)

* `U1`, `U2`, `U4`, `U8`: unsigned integer types with the indicated number of
bytes

* `UG`: unsigned type with 16-bytes, typically used as an unique ID
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as an unique" should be "as a unique"

... yes I had to look this one up to make sure I'm not crazy: https://www.quora.com/Which-is-the-correct-grammar-usage-a-unique-or-an-unique/answer/Stephanie-Fysh?share=1 #Resolved

than the number of bytes in the source's underlying type, or the `Count`
value is positive. In the latter case, the `Count` is necessarily less than
`2^^k`, where `k` is the number of bits in the destination type's underlying
type. For example, `U1[1-*] `can be converted to `U2[1-*]`, but `U2[1-*]`
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`U1[1-*] `can be converted

has the space on the wrong side of the back tick. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first read this, my brain somehow filled in "traintrack" for "back tick." I imagined a poor little space, on the wrong side of the backtick, suddenly finding itself set upon by the unsavory characters in the KeyType gang...


In reply to: 188813760 [](ancestors = 188813760)

Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you think... #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Justin, just clarifying, was there a comment attached here? I've reviewed the language in the section about up-conversion, it seems correct?


In reply to: 188877915 [](ancestors = 188877915)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oh oh... this was a comment on the earlier issue. Geez CodeFlow github integration sure isn't great at tracking replies. :P


In reply to: 188963444 [](ancestors = 188963444,188877915)

column across multiple rows. Block format is dictated by a codec. There is a
table-of-contents and lookup table to facilitate quasi-random access to
particular blocks. (Quasi in the sense that looking up the value for a column
and particular row may require .)
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfinished thought #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just trying to treat everyone to an of course moment. Yes, that will do.


In reply to: 188814295 [](ancestors = 188814295)

use of LEB128 in places where we are writing values that, on balance, we
expect to be relatively small, and only in cases where there is no potential
for benefit for random access to the associated stream, since LEB128 is
incompatible with random access. However this is not formulated into anything
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"However this" should be "However, this" #Resolved

8 | ulong | **Version**: Indicates the version of the data file.
16 | ulong | **CompatibleVersion**: Indicates the minimum reader version that can interpret this file, possibly with some data loss.
24 | long | **TableOfContentsOffset**: The offset to the column table of contents structure.
32 | long | **TailOffset**: The eight-byte tail signature starts at this offset. So, the entire dataset stream should be considered to have eight plus this value bytes.
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this value bytes" should be (something like) "this value in bytes" #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually grammatically correct but, I agree, slightly confusing. I think you were parsing it as "(eight) plus (this value bytes)", but the correct interpretation is "(eight plus this value) bytes". I have changed it though, to say it has "byte length of eight plus this value," to remove any misunderstandings.


In reply to: 188814633 [](ancestors = 188814633)


The enum for compression kind is one byte, and follows this scheme:

Compresion Kind | Code
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Compresion" should be "Compression " #Resolved

Zlib (i.e., [RFC1950](http://www.ietf.org/rfc/rfc1950.txt)) | 2

None means no compression. DEFLATE is the default scheme. There is a tendency
to conflate Zlib and DEFLATE, so to be clear: Zlib can be (somewhat inexactly)
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"zlib" is generally fully lowercase. #Resolved

None means no compression. DEFLATE is the default scheme. There is a tendency
to conflate Zlib and DEFLATE, so to be clear: Zlib can be (somewhat inexactly)
considered a wrapped version of DEFLATE, but it is still a distinct (but
closely related) format. However both are implemented by the Zlib library,
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"However both are" should be "However, both are" #Resolved

feature vector to be a vector of floating point values and *not* keys, in
typical usage the majority of usages of keys is as some sort of intermediate
value on the way to that final feature vector. (Unless, say, doing something
like preparing labels for a multiclass learner or somesuch.)
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"somesuch" should be "some such" #Resolved

Copy link
Contributor Author

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? Hmmm.

Anyway I think I can delete it, it's a bit too folksy of a colloquialism and the "say" already suggests that this is merely one example and there's no need to belabor the point.


In reply to: 188815291 [](ancestors = 188815291)

implementation side, which is both less for us to maintain, and also for users
gives consistency in behavior.

So for example, the `charTokenize` above might appear to be a strange choice:
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend UpperCamel for CharTokenize #Resolved

gives consistency in behavior.

So for example, the `charTokenize` above might appear to be a strange choice:
*why* represent characters as keys? The reason is that the N-gram transform is
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a few spelling of "ngram" in the docs. Personally, I like "ngram" and "chargram". #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me know what you think


In reply to: 188816009 [](ancestors = 188816009)

and so forth.

There is another implication: a hypothetical type `U1<4000-4002>` is actually
a sensible type in this scheme. The `U1` indicates that is is stored in one
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that is is stored" should be "that is stored" #Resolved

representations, since they are equivalent it follows that any code consuming
`VBuffer`s must work equally well with *both*. That is, there must never be a
condition where data is read and assumed to be either sparse, or dense, since
implementors of `IDataView` and related interfaces are perfectly free to
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"implementors" is less common of a spelling than "implementers" #Resolved

One possible alternate (wrong) implementation of this would be to just say
`dst=src` then scale all contents of `dst.Values` by `c`. But, then `dst` and
`src` would share references to their internal arrays, completely compromising
the callers ability to do anything useful with them: if the caller were to
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"callers" should be "caller's" #Resolved

representations, since they are equivalent it follows that any code consuming
`VBuffer`s must work equally well with *both*. That is, there must never be a
condition where data is read and assumed to be either sparse, or dense, since
implementors of `IDataView` and related interfaces are perfectly free to
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"implementors" is a less common spelling of "implementers" #Resolved

operate on acceptable sets.

1. The simple enumeration of `UInt128` numeric values from any number is an
acceptable set. (This covers how most loaders generate IDs. Typically we
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Typically we" should be "Typically, we" #Resolved

optional metadata.

Column names are case sensitive. Multiple columns can share the same name, in
which case, one of the columns hides the others, in the sense that the name
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to say which column hides the others. Perhaps, "the highest index numbered". #Resolved

Copy link
Contributor Author

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. While this is part of the existing implementations, it isn't really part of the spec or ISchema interface. If you were to construct a dataview with columns ["A", "B", "B"], but the schema implementation behaves so that TryGetColumnIndex("B", out int colIdx)" succeeds with colIndex==1`, it would behave perfectly in, say, a transform pipeline and whatnot. I will though add a note about this in the implementation -- actually maybe a bit more, since I see that I had neglected to mention anything about the schema in the implementation notes aside from one or two notes about metadata...


In reply to: 188891796 [](ancestors = 188891796)

* Unsigned 16 byte values for ids and probabilistically unique hashes
* Date time, date time zone, and timespan
* Key types
* Vector types
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend "Vector types—for the above base types" #Resolved

Copy link
Contributor Author

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. If it were that simple I might do that, but it's actually more flexible than that. Vector types over any so-called primitive type, including user defined types, are perfectly all right. But if I were to do that here I would have to define the concept of a primitive type. We already have a vector section though that goes into this in somewhat more detail. I'd prefer to leave that detail there.


In reply to: 188892678 [](ancestors = 188892678)

* Key types
* Vector types

The set of standard types will likely be expanded over time.
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call out some notably (currently) missing types? eg: half-precision floats, mixed types, decimal, tensors. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I'm inclined to not do anything we don't have a definite plan for, if that's all right.

FYI the "tensor" type is a multidim vector, which is already covered as we see for.


In reply to: 188896792 [](ancestors = 188896792)

Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I released the tensor was wrong as I was re-passing the longer datetype discussion in IDataViewTypeSystem.md and forgot to cleanup my comment.

Not listing the missing types is good too; just ideas. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes like I say I'm wary of listing them, since it could easily be interpreted as a definite future plan. Mentioning that we're adding more IDataView components, sure, but more types -- eh.


In reply to: 189032170 [](ancestors = 189032170)

* Unsigned integer values using 1, 2, 4, or 8 bytes
* Unsigned 16 byte values for ids and probabilistically unique hashes
* Date time, date time zone, and timespan
* Key types
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the list of supported column types, should we include, "User defined types"? For example you could pass a custom datatype between two transforms. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is not a list of "supported" column types, it is a list of "standard" types. (You'll note that we already talked about user defined types in the area preceding this list.)


In reply to: 188897498 [](ancestors = 188897498)

individual slots of a vector-valued column, values associated with a key type
column, whether a column's data is normalized, etc.

While IDataView schema supports an arbitrary number of columns, it, like most
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the limits of the "arbitrary" number of columns? int.MaxValue? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's arbitrary in the sense that the schema is not fixed, not arbitrary in the sense of completely unlimited and unbounded.


In reply to: 188898484 [](ancestors = 188898484)


Machine learning and advanced analytics applications often involve high-
dimensional data. For example, a common technique for learning from text,
known as bag-of-words, represents each word in the text as a numeric feature
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to our users to link Wikipedia-terms like bag-of-words, or more of a distraction? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah actually, especially since the point of this paragraph seems to be to explain the concept. Wouldn't hurt to have a backup description at hand in case this brief description is insufficient.


In reply to: 188899313 [](ancestors = 188899313)

that maps a text value to the sequence of individual terms in that text,
naturally produces variable-length vectors of text. Then, a hashing ngram
transform may map the variable-length vectors of text to a bag-of-ngrams
representation, which naturally produces numeric vectors of length $2^k$, where
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think math renders in github. Could hot-link to a service such as:

src: <img src="https://latex.codecogs.com/gif.latex?$2^k$" title="$2^k$" />

Though we'd have to investigate which services are suitable for use; though the more appropriate method maybe creating an img folder and storing the prerendered. That option is hard to edit though. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple things like this, I'd rather just format them differently.

What I've seen other people do is format this like 2^^k or something. (Not sure why ^^... Python does **, everyone knows Python, I'm not sure what language ^^ is from.)


In reply to: 188901632 [](ancestors = 188901632)

Copy link
Contributor Author

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW don't search for "latex image service" at work or with kids in the room... trust me on this. :P


In reply to: 188975188 [](ancestors = 188975188,188901632)

examples of this throughout the codebase.

Nevertheless: in very specific circumstances we have relaxed this. For
example, the TLC API serves up corrupt `IDataView` implementations that have
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "TLC" remain in the docs? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh no.


In reply to: 188904967 [](ancestors = 188904967)

savers, trainers, predictors, etc.). The team is actively working on many
more.

The name IDataView was inspired from the database world, where the term table
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is also in IDataViewDesignPrinciples.md #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's deliberate. This is kind of spec-ish whitepaper-ish.


In reply to: 188909199 [](ancestors = 188909199)

Copy link
Contributor

@shauheen shauheen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

For example:

* A column may have a `BL` valued piece of metadata associated with the string
`IsNormalized` indicating whether the column can be interpreted as a label.
Copy link
Contributor

@justinormont justinormont May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this example confusing. Are we saying a user named the column IsNormalized, or is the name of the metadata IsNormalized. Why would either of these tell if the column can be a label? (perhaps regression for linear models?)

Perhaps: "* A column may have a BL valued piece of metadata (along with its associated string IsNormalized to name it) indicating the column has already been normalized, and won't require additional normalization before certain learners or transforms." #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire sentence is absolute nonsense so I'm just going to rewrite it. Normalization has nothing to do with labels. Also I'm not sure why we're saying "Metadata is associated with the string..." Metadata just has names, period.

Hello, I am associated with the string Tom Finley. :D :D


In reply to: 189046936 [](ancestors = 189046936)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you great person associated with the string, "Tom Finley"!


* A column produced by a scorer may have several pieces of associated
metadata, indicating the "scoring column group id" that it belongs to, what
kind of scorer produced the column (e.g., binary classification), and the
Copy link
Contributor

@justinormont justinormont May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 9 instances of double spaces in this file (and perhaps more at the EOL boundary). eg: "scorer · · produced" #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I've successfully regular expressioned my way around weeding out these problems.


In reply to: 189163935 [](ancestors = 189163935)

a column of a custom (non-standard) type, `Picture<*,*,4>`, where the
asterisks indicate that the picture dimensions are unknown. The last dimension
of `4` indicates that there are four channels in each pixel: the three color
components, plus the alpha channel. Applying a `BitmapScaler` transform scales
Copy link
Contributor

@justinormont justinormont May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we deprecated BitmapScaler for ImageResizer (and similar for the rest of the image transforms. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right... and image loader transform and so on, and so on.


In reply to: 189171796 [](ancestors = 189171796)

* All numbers are stored as little-endian, using their natural fix-length
binary encoding.

* Strings are stored using an unsigned LEB128 number describing the number of
Copy link
Contributor

@justinormont justinormont May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend moving the LEB128 wiki link to the first mention, as I just looked up LEB128 before getting to the wiki link in the next paragraph. #Resolved


A `VBuffer<T>` is a generic type that supports both dense and sparse vectors
over items of type `T`. This is the representation type for all
[`VectorType`](../public/IDataViewTypeSystem.md#vector-representations)
Copy link
Contributor

@justinormont justinormont May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken link. Should be: [`VectorType`](IDataViewTypeSystem.md#vector-representations) #Resolved

Regarding the generic type parameter `T`, the only real assumption made about
this type is that assignment (that is, using `=`) is sufficient to create an
*independent* copy of that item. All representation types of the
[primitive types](../public/IDataViewTypeSystem.md#standard-column-types) have
Copy link
Contributor

@justinormont justinormont May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken link. Should be: [primitive types](IDataViewTypeSystem.md#standard-column-types) #Resolved

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've finished my passes over the docs. Besides the above newly added (6) comments, it looks good to me.
The only other unresolved topic was n-gram vs. ngram, which I think we should punt on.

Copy link
Contributor

@shauheen shauheen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@justinormont justinormont merged commit 3012428 into dotnet:master May 22, 2018
@TomFinley TomFinley deleted the tfinley/CodeDocs1 branch May 29, 2018 21:55
eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
Migration of some existing internal documentation, rephrased in some places to be more appropriate in context (hopefully successfully). Related to dotnet#160, though this PR would be just part of addressing the issue of moving over internal docs.
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants